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Abstract 

The learning dynamics of on-line independent component analysis is analysed in 
the limit of large data dimension. We study a simple Hebbian learning algorithm that 
can be used to separate out a small number of non-Gaussian components from a high- 
dimensional data set. The de-mixing matrix parameters are confined to a Stiefel man- 
ifold of tall, orthogonal matrices and we introduce a natural gradient variant of the 
algorithm which is appropriate to learning on this manifold. For large input dimension 
the parameter trajectory of both algorithms passes through a sequence of unstable fixed 
points, each described by a diffusion process in a polynomial potential. Choosing the 
learning rate too large increases the escape time from each of these fixed points, ef- 
fectively trapping the learning in a sub-optimal state. In order to avoid these trapping 
states a very low learning rate must be chosen during the learning transient, resulting 
in learning time-scales of 0{N'^) or 0{N^) iterations where N is the data dimension. 
Escape from each sub-optimal state results in a sequence of symmetry breaking events 
as the algorithm learns each source in turn. This is in marked contrast to the learning 
dynamics displayed by related on-line learning algorithms for multilayer neural net- 
works and principal component analysis. Although the natural gradient variant of the 
algorithm has nice asymptotic convergence properties, it has an equivalent transient 
dynamics to the standard Hebbian algorithm. 

1 Introduction 

On-line learning algorithms are often used for dealing with very large data sets or in dy- 
namic situations in which data is changing according to a non-stationary process. Inde- 
pendent component analysis (ICA) is often applied under one or both of these condition s 
and a number of on-line ICA algorithms have been developed (see, e.g. lHyvarinenL 1999i) . 
In on-line learning the model parameters are updated after the presentation of each train- 
ing example. Although there is good understanding of the asymptotic properties of on-line 
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learning, much of the learning takes place far from the asymptotic regime and here a the- 
oretical understanding of the process is often poor. Training methods are typically based 
on experimental observations in the absence of a successful theoretical analysis of on-line 
learning in this regime. The efficiency of on-line training is very sensitive to the choice of 
training parameters such as the learning rate and this dependence can slow down learning 
and influence the ability of learning to converge successfully to desired states. A deeper the- 
oretical understanding of the on-line learning process is needed to provide reliable methods 
for setting the parameters and optimising the training process. 

Mo st classical theoretical resu lts on on-line learning are from stochastic approximation 



theory (iKushner and Clark . 



approaches, see e.g. 



Saad 



1978 ). For a review of recent advances and modern theoretical 



(Il998h . Theories describing on-line learning have mainly been 
developed in two different asymptotic regimes. Most work has been done in the limit of 
the long times, in which case the model parameters are close to a stable fixed point of the 
learning dynamics. Here one can work out the asymptotic distribution of the parameters 
for constant learning rate or study their convergence to a fixed point as the learning rate is 
reduced according to some annealing schedule. The conditions under which the parameters 
converge at an optimal rate are quite well understood (White, 1989). Work has also been 
carried out in the limit of small learning rates, in which case the dynarn ics can be shown 
to follow the mean gradient or flow globally ( Kushner and Clarfl 197 8h . Unifying these 
two strands to some extent one can study the long-time global behaviour under an annealed 
learning schedule (Kushner, 1987). 

It is unclear how much practical relevance the classical asymptotic limits have, since of- 
ten in practical applications learning is not asymptotic in the learning times or in the learn- 
ing rate. In this work we pursue a different type of asymptotic analysis by considering the 
limit of large system size. This limit has been studied extensiv ely bv researchers applvini 
statistical mechanics methods to the study of learning systems (lEngel and Van den Broec 
20011). For example, the dynamics of gradient d ecent in multilayer perceptrons (MLPs) 



teiehl and SchwarzeL[l995lisaad and Sollal.ll995h and the dynamics of Sanger s principal 
component analysis (PCA) algorithm (Biehl, 1994, Biehl an d Schlosser. 1998) have been 
studied in this limit. In these examples the dynamics displays interesting and non-trivial 
transient behaviour with unstable, sub-optimal fixed points appearing due to a symmetry in 
the parameter space of the models. The dynamics of natural gradient learning in MLPs has 
been studied using these techniques, showing that the transient fixed point becomes less se- 
rious an d transient learning performance is improved compared to standard online gradient 



descent (IRattrav et all 1199 



; p eriormance is improvea compared to stanaard onlme gradient 
)!, Rattrav and Saad , 1999h . Since a natural gradient algorithm 



has been s hown to provide an asymptotically efficient on-line learning algorithm for ICA 
(lAmariet a l.. 1996, AmarL 1998) we are interested in whether the statistical mechanics 
approach developed here can help understand its transient performance. 

We have recently developed a novel theoretical framework for studying the dynamics 
of on -line ICA in the limit of large data dimension ("Rattravl Eooi 'Rattray and Basalygal 
2003) . We study a Hebbian learning ru le, which is a simple and p opular algorithm for on- 



line ICA with nice stability conditions JHyvarinen and OiaL 19981) and is closely related to 
a popular fixed-point batch algorithm ( Hvvarinen and OiaL 1 19971) . The algorithm is partic- 
ularly amenable to analysis in the limit of large input dimension because it can be used to 
extract a small number of independent components from high-dimensional data. We will 
see later that this allows the dynamics to be represented by a relatively small number of 
variables when the data dimension becomes large. As well as studying the standard version 
of the algorithm we also develop a natural gradient variant and study its dynamics. Because 
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the parameter space is constrained to the Stiefel manifold of orthogon al rectangular matri- 
ces, th e sta ndard equivar i ant or natural gradient ICA algorithms due to lCardoso and Laheld 
(ll996^ and lAmari e t al.' ('l996^ are not appropriate here. Instead it is possible to use the 
ideas developed by Edehnan et al.. (>1999.) in order to construct an algorithm which follows 
the gradient on a Stiefel manifold. Our variant uses the gradien t defin ed by Amari (199^ 
but includes the orthogonalisation term from lilvvarinen and Oial (il998l) in order to keep the 
model parameters on the Stiefel manifold. 

In order to obtain a tractable ICA model, we consider an idealised data set in which a 
small nu mber of non-Gaussian sources are mixed into a large number of Gaussian sources 
( RattravL r2nn2). By using the methods of statistical mechanics, we provide a solution to 
the dynamics of Hebbian ICA in the limit of large input dim ension. This generalises on 
previous results ( Rattray , 2002L iRatt rav and Basalvgal. 2002h . which were limited to the 
simplest single source case (except for the late time asymptotics, which were solved for the 
general case). We find that the transient dynamics of Hebbian ICA can be described as a 
stochastic process in which the system moves through a sequence of metastable fixed points, 
each of which can be described as a multi-dimensional diffusion in polynomial potential. By 
solving the dynamics of the multi-source case we can characterise the symmetry breaking 
process which is critical to performance of the learning process. 

It is interesting to observe that the dynamics of ICA is v ery different fro m the dynamics 
of learning in MLPs and in PCA studied previously ( Saad and Solla>..1995i..Biehl and SchlosseJ. 
E998). In these cases the dynamics was observed to be "self-averaging" so that it followed 
a smooth trajectory in the limit of large data dimension and the learning happened on an 
0{N) time-scale. The ICA dynamics displays significant fluctuations even in this limit and 
learning occurs on a much slower time-scale, typically requiring of the order of or 
iterations depending on the details. We find that the natural gradient variant of the Hebbian 
algorithm has very nice asymptotic convergence properties with uniform convergence in the 
case of equal source statistics. However, the algorithm is shown to have equivalent transient 
performance to the standard algorithm. 

This paper is organised as follows. The data model is described in Section |2l The 
on-line Hebbian ICA algorithm is introduced in Section |3l In Section |4] we introduce the 
macroscopic variables which provide a compact description of the dynamics for large N. 
In Section|5]we show that the learning dynamics near a sub-optimal fixed point close to the 
initial conditions can be considered as a diffusion process. The transient dynamics through a 
sequence of metastable states is analysed in Section|6l In SectionQwe introduce the natural 
gradient version of Hebbian ICA algorithm and study its dynamics. General conclusions 
are made in Section [8l 



2 Data Model 



We consider the following idealised linear data model introduced in iRattrav (l2n02h . The 
A^-dimensional data x is generated from a noiseless linear mixture of a small number M of 
non-Gaussian sources s and a large number — Af of uncorrected Gaussian components, 
n Af{0,lN-M), 

'" s 



X 



n 



(1) 



where A = [Ag An] is the N xN mixing matrix, denotes an x identity matrix and 
J\f{a, S) denotes a Gaussian distribution with mean a and covariance matrix XI. Without 
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loss of generality it can be assumed that the sources each have unit variance. In order to ap- 
ply the Hebbian ICA algorithm we assume also that the data is already sphered, i.e. the data 
has zero mean and an identity covariance matrix. This can be achiev ed for on-line leam- 
ing by an adaptive sphering algorithm, such as the one introduced by Cardoso and Laheldl 
(11996). These model assumptions lead to the constraint that A must be an orthogonal ma- 
trix, e.g. 



[A 



s -An] 



A^ 



A^ 

\At^ Ar 



AsAg + AfiA^ — 


I, 




4T4 4T4 




10' 


4T4 4T4 




1 



(2) 



(3) 



3 On-line Hebbian ICA Learning Rule 

The goal of ICA is to find the de-mixing matrix W such that the projections, 

y 



(4) 



will coincide with the non-Gaussian sources s up to scaling and permutations. The best 
possible solution is one in which the K projections will learn as many as possible of the 
M non-Gaussian sources. Note that specialisation of each projection to a particular source 

mostly depends on the details of the initial condi tions. 

We consider a simple Hebbian learning rule ( Hvvarinen and Ojal 1998h . which extracts 
non-Gaussian sources from the data by maximising some measure of non-Gaussianity of 
the projections. The change of the N x K de-mixing matrix W at the each learning step is 
given by, 

AW = r] x(f){yf(T + aW{I -W^W) . (5) 
Here, r/ is the learning rate and a is a K x K diagonal matrix with elements 

an = sign (E^^ [si4>{si) - 4>'{si)]) , (6) 

which ensures stability of the correct solution as yi Si dHvvarinen and Oial 1998h . The 
first term on the right of ^ maximises some measure of non-Gaussianity of the projections. 
The second term provides orthogonalisation of the de-mixing matrix. The choice of learning 
rate rj greatly influences the performance of this algorithm. Choosing too large a learning 
rate results in slow and inefficient learning but choosing too high a value may result in the 
learning dynamics becoming trapped in a poor solution, as we will see later. The learning 
dynamics is less sensitive to the choice of the parameter a and we set a = 0.5 in simula- 
tions. The function cf){y) = [(/>(yt)] is some smooth non-linear function which is applied to 
every component of the vector y. An even non-linearity, e.g. 0(y) = y^, is usually used 
to detect asymmetric non-Gaussian signals, while an odd non-linearity, e.g. (j){y) = or 
(/){y) = tanh(y), is used to extract symmetric non-Gaussian signals. 



4 Learning Dynamics for Large Data Dimension 

In the case of high-dimensional data (when becomes very big) it is difficult to analyse 
the dynamics of the N x K de-mixing matrix. In order to provide a compact description of 
the system dynamics in the limit — > 00, we introduce new macroscopic variables 

R = W^A, and Q = W^W , (7) 
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the dimension of which are K x M and K xK respectively. These overlap matrices contain 
all necessary information about the relationship between the projections y and the sources 
s. Therefore the system can be described by a small number of macroscopic quantities in 
the limit of large N as long as K and M remain small. We will usually order the indices 
retrospectively so that the dynamics approaches the optimal solution with Rij = 6ij (as- 
suming orthogonal W such that Q = I). However, it should be remembered that there are 
equivalent optima related to this solution by a permutation of indices or changes in sign of 
the components of R. In order to learn successfully the algorithm has to break symmetry 
and specialise to a particular solution. 
Using Equation ^ one can show that, 

y = W^{AsS + Ann) = Rs + z, (8) 

where z ~ AA(0, C) and C = Q — RR^. In order to analyse the dynamics we will have 
to compute expectations with respect to the distribution of y. The covariance matrix C is 
symmetric and positive definite so that we can always write z in the form z = Lfi where 
fj, ~ AA(0, /) are Gaussian variables and the matrix L can be found by special decomposi- 
tion. A p articul arly useful decomposition for our purposes is the Cholesky decomposition 
(see, e.g. IPress et al.. 1992) in which case C = LL^ where L is lower-triangular with 
non-zero components satisfying the following recurrence relations 



L 



kk 



\ 



Ckk-Y.^lj (9) 



i-l 

M = ; , (10) 



and for the A;-th row of L we have 

^ Cki — X^j=i LijLkj 

for i = 1,2, k — 1. 

From Equation ^ we can calculate the changes in R and Q after a single learning step, 

AR = 7]a(t){y)s^ + a{I - Q)R , (11) 

AQ = r^a{I + a{I-Q))^{y)y^ + a\l-QfQ 

+ rj<^y<t){yf{I + a{I-Q)) + 2a{I-Q)Q 

+ rfct){y)x^xct>{y)^ , (12) 

where we used the constraint in Equation ^ to set Ag = . 

The dynamics is not very sensitive to the exact value of a as long as a ^ r]. We will 
see later that the learning rate must be chosen very small for large N so that typically we 
will have this situation in practice. As a increases relative to r/, Q approaches / since the 
orthogonalisation term in Equation ^ dominates. If one defines Q — I = q/a and sets a 
large relative to rj then the fixed point of Equation ( fT2t to leading order is, 

q = \[ria {cl>{y)y^ + ycl){yf) + 7j^Ncl){y)ct>{yf] , (13) 

where we have dropped terms lower than 0{rf'N) and 0(r/). Substituting Equation (fT3l 
into Equation (II It leads to the following update equation for R, 

AR = r]a- [^(y)^^ - 1 (0(y)yT + r] _ i rj^Nct){y)ct){yf R . (14) 
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7? = 3.5 X 10-^ r] = 0.0025 r] = 0.005 




Figure 1 : Typical learning dynamics of the Hebbian ICA algorithm for different learning 
rates. We used the non-linearity (p{y) = to extract three asymmetrical binary sources 
with skewness K3 = 1.5 from 100-dimensional data {K = M = 3, N = 100). Each 
picture shows the quantity J2ij I ^ij I it changes over time. For the smallest learning rate 
(rj = 3.5 X 10~^) the dynamics looks relatively smooth but it takes time to learn all three 
sources and the dynamics is localised at a sub-optimal metastable state near \Rij\ = 2 
for a significant period of time. With a larger learning rate (rj = 0.0025) the dynamics 
appears more stochastic and the system is again localised in the same metastable state. For 
the largest learning rate (rj = 0.005) the fluctuations are so strong that the system remains 
trapped in a sub-optimal state close to the initial conditions for the entire simulation time. 



This simplifi cation procedure is an example of adiabatic elimination of fast variables (see, 
for example, Gardiner . IQSS"). In a more rigorous treatment one should consider the mean 



and covariance of AR and AQ using the appropriate large N scalings which ai^e described 
in the next sections. One finds that fluctuations in Q are negligible in the limit of large N 
and therefore the dynamics of Q can be described by a differential equation in this limit 
with stable fixed point given by Equation (fT3l . 

In Figure ^ we show some typical dynamical trajectories. We observe the following 
types of fixed points in the i?-dynamics: 

• Optimal fixed points (see Figure [2(a) and (b)) 

Asymptotically, when yi Si, the optimal solution is given by R*- = 6ij, which is 
a fixed point of Equation (fT4l as r] —>■ 0. Note that all other possible solutions can be 
obtained by a trivial permutation of indices and/or changes in sign. Asymptotically 
the learning rate should be annealed in order to approach this fixed point at an optimal 
rate. For a detailed account of the asymptotic dynamics of the Hebbian ICA algorithm 
under an annealed learning rate, see Rattray (2002). 

• Sub-optimal fixed points causing trapping near the initial conditions (see Figuren(c)) 

Due to the 0{if') fluctuation term in Equation (I14t . the algorithm has a special class 
of sub-optimal fixed points near i? = which causes the presence of a stochastic 
trapping state near the initial conditions. We will discuss this situation in detail in the 
next Section. 

• Transient fixed points (see Figure Q (a) and (b)) 

When T < M non-Gaussian sources have been learned, the dynamics can become 
locaUsed in metastable states somewhere between the initial conditions and the final. 
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optimal fixed point. In tiiis case tlie fixed points of Equation (fT?t are 



R*j = \[i<T], (15) 

where 

If r di tel = I ^ if predicate is true, 

\ if predicate is false. 

Again, similar fixed points can be obtained from (fTSl by a simple permutation of in- 
dices. In this case the system has to leave such metas table states in order to complete 
learning. We will analyse the dynamics near these metastable states in Section|6l 

5 Stochastic Trapping State near the Initial Conditions 

In similar studies of on-line learning, macroscopic quantities like the overlap R usually have 
a "self-averaging" property such that the variance of these macroscopic quantities tends to 
zero in the limit N ^ oo (see e.g. Saad and S oUa. 1995. Biehl and Schlosser. 1998). A 
random and uncorrected choice for A and the initial entries of de-mixing matrix W leads 
us to expect R = 0{N~^/'^) initially. Larger initial values of R could only be obtained 
with some prior knowledge of the mixing matrix which we will not suppose. In this case 
one can no longer assume that fluctuations are negligible as oo. Moreover, as we 

will see below, the mean and variance of the change in R at each iteration are of the same 
order. That means that the overlap R does not self-average and the fluctuations have to be 
considered even in the limit. Therefore, it is more natural to model the on-line learning 
dynarnics near the initial conditions as a diffusion process (see, for example, Gardineii 
1985 . van KampenL 1992h . In order to establish a clear picture of the dynamics we have to 



choose an appropriate scaling for macroscopic quantities and learning parameters. In the 
following discussion we set r = R^/N where r is assumed to be an 0(1) quantity. 

5.1 Diffusion in a Potential 

For a diffusion process the probability density p(r, t) of a random variable r = [rij] at time 
t obeys the Fokker-Planck equation 



dt ^ drij 2 f-; dvijdrM 



(17) 



where the coefficients Aij, which are called "drift" coefficients, represent the expectation 
of the change of variable r^j with respect to the stochastic process in question. They can be 
written as 

yl,,■=E[Ar,,] = -^^iV-^ (18) 

where and p are the input dimension and the scaling order for our system and U{r) is 
some differentiable function of r. This function is analogous to the potential function for 
the case of a particle undergoing a diffusion in a potential. The coefficients Bijki are the 
covariance of the change of variable rij, 

BijM = Cov[Ar,j-, Am] = EiAnjAm] - E[Arij]E[Arki] = DijkiN-P, (19) 

where -Djjfci is called the "diffusion term". Usually this would be a matrix but notice that 
in our case the dynamical variables are in a matrix and therefore the diffusion term has four 
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odd, K4 / 0(y) even, K3 / 




Figure 2: Close to the initial conditions the learning dynamics is equivalent to diffusion in 
a polynomial potential. For symmetrical source distributions with non-zero kurtosis K4 we 
should use an odd non-linearity in which case the potential is quartic, as shown on the left 
(for K = I, M = 2). For asymmetrical source distributions with skewness K3 we can use 
an even non-linearity in which case the potential is cubic, as shown on the right. Escaping 
over the minimal barriers results in symmetry breaking, i.e. specialisation to a particular 
source. 



indices. However, we can think of each pair as a single index in a vectorised system. If we 
do so then for our case the diffusion term can be considered a diagonal matrix of magnitude 
D, 

Dijki = D 5ik6ji, (20) 

where D is called the "diffusion coefficient". It is typical for the diffusion process that 
the mean and covariance of the random variable are of the same order ~ 0{N^p). This 
means that, for large A^, the system is equivalent to diffusion in the potential U{r) with a 
characteristic time-scale (dt)^^ = N^. 

In our case the potential U (r) has a minimum at r^j = surrounded by potential 
barriers of differing heights. Examples are shown in Figure |2l for an odd and even non- 
linearity on the left and right respectively. The escape time (the mean first passage time) 
from the minimum of the potential at r = is mainly determined by the effective size of 
the minimal potential barrier AU (see, for example. fOardiner. . 1 985- .van Kampen. .1992) . 

^'escape = A GXp ' 1 ) 

where prefactor A is proportional to the characteristic time-scale and depends on the curva- 
ture of the potential. Because diffusion through a higher potential barrier is exponentially 
less Ukely as the difference between barrier heights increases, the escape time will effec- 
tively be inversely proportional to the number of escape points (the number of barriers with 
the same minimal height AU) when the Arrhenius factor AU /D is large. 

In the next two sections we will find the form of potential barrier and estimate the escape 
time from the trapping state for the two main classes of non-linear function 4>{y). 

5.2 Odd Non-linearity 

If we need to extract symmetrical non-Gaussian signals from the data then we have to use 
an odd non-linearity, e.g. = or cl){y) = tanh(y) are common choices. In this case 
the appropriate scaling for the learning rate will be r/ = uN^^, where u is an 0(1) scaled 
learning rate parameter. After expanding Equation near r = we obtain the following 
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expressions for the mean and covariance of the change in r at each iteration, 

ElAnj] = (^-l{4>\^^))u% + '^Ki{^"'{fi))auurfj) N'^ + 0{N-^), (22) 
Cov[Ar,„ Am] = {<pHf^)y S.kSji iV^^^ + 0{N-^), (23) 

where is the fourth cumulant of the j-th source distribution (measuring kurtosis) and 
brackets denote averages over a Gaussian variable /x M{0,I). In this case the system 
can be described by a Fokker-Planck equation for large N with a characteristic time-scale 
{6t)~^ = N^. To compute the expectations we have made use of the Cholesky decomposi- 
tion of y as described in Equations ^ and (flOt . This allows us to remove the dependence 
of the parameters in the Gaussian averages by writing y = L/j, where the lower diagonal 
(i.e. non-zero) elements of L are found to be, 



Lij = 6ij - ( ^6ij }_^rf^ + }_^ Ti^rjm 1 iV ^ + 0{N 2) for i > j. 

\ m m / 

The dynamics is equivalent to diffusion in the following potential 

K M 

U{r) = E E (3('^'(^)) ^'4 - ^Ai^"'if^)Wu^rf^) (24) 
i=i j=i 

with a diagonal diffusion matrix of magnitude D = . Notice that the potential 

is a sum of contributions U{r) = J2ij Ui'^ij) which depend only on a single element in 
r. Since the diffusion matrix is diagonal, this means that each element of r under goes an 




independent diffusion process equivalent to the one-dimensional case described by iRattrav 
The effective size of the potential barriers for each element rij is given by. 



AUin,) _ 3(</.2(^))i. 



D 8|4(,/.'"(/z))| 



(25) 



Escape over one such potential barrier results in the corresponding source j being learned 
by projection i. This breaks the symmetry of the system as one projection specialises to a 
particular source. Once this happens the system can again become trapped in a metastable 
state with other sources remaining unlearned. The dynamics in the neighbourhood of this 
more general class of fixed point is described in Section|51 

The shape of the potential for an example with two sources (with equal kurtosis) and 
one projection {K = 1, M = 2) is shown on the left of Figure|2l In this case we have four 
minimal potential barriers which the system has to overcome. For the special case when all 
non-Gaussian sources have the same kurtosis {i = 1, 2, M), the escape time, 

i.e. the mean first passage time for a single source to be learned, is given by. 



J escape 0^ 2M K ^ 



?|k4(0'"(^))| 



(26) 



where a = sign{{(j)"' {/j,)) K4) is a necessary condition for successful learning. Notice that 
the time-scale for escape diverges with the learning rate. On the left of Figure |3l we show 
numerical simulations for this situation. We have generated data of dimension = 50 with 
M non-Gaussian sources each with a uniform distribution and we extract a single projection 
K = I. The figure shows the dependence of the escape time on the number of non-Gaussian 
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(t){y) = y^,u = 1, K4 = -4/9 



(piy) =y'^,u = 2, Ks = 1.5 




Figure 3: Dependence of the escape time (time to learn at least one source signal) on the 
number of non-Gaussian sources M for Hebbian ICA with one projection (K = 1). The 
solid line shows the slope predicted by the theory for comparison. The points with error 
bars denote the simulation results for = 50 averaged over 50 experiments. 



sources, on a log-log plot. The solid line shows the inverse scaling with M predicted by the 
theory and this is consistent with the simulation results. A source was considered learned 
when the associated overlap Rij was observed to be greater than a threshold magnitude (0. 1 
greater than the position of the potential barrier). 

If the sources have different kurtosis then the algorithm will be most likely to learn the 
source with highest kurtosis first since this will correspond to the escape point with the 
lowest potential barrier. The mean first passage time will be dominated by the contribution 
from this barrier for large learning rates and the escape time will not depend on the number 
of sources. 



5.3 Even Non-linearity 

If the non-Gaussian signals are asymmetrical, then we can use an even non-Unearity, for 
example (f){y) = y"^. In this case the appropriate scaling for the learning rate is = uN^^I"^. 
After expanding Equation (fUTi near r = we find that the mean and covariance of the 
change in r at each iteration are given by (to leading order in A^^^), 

K 

Cov[Ar,„ Ar^] = {4>''{l^)>^ S^kdjiN'^ (28) 

where Kg is the third cumulant of the j-th source distribution (third central moment), which 
measures skewness, and brackets denote averages over Gaussian variables fi ~ AA(0, i"). 
Again the system can be described by a Fokker-Planck equation for large N but now with 
shorter characteristic time-scale {6t)^^ = N"^. The system is locally equivalent to a diffu- 
sion process in the cubic potential 

KM K 

i=l j = l lyti 

with a diagonal diffusion matrix of magnitude D = {(fp'{p))v'^ as before. In this case the 
sum in the potential contains "cross-terms" which depend on more than one element in r. 
The dynamics is therefore not equivalent to the one-dimensional case and features of the 
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5 10 15 20 ^ 5 10 15 



Figure 4: Dependence of effective size of barrier of number of projections K for the specific 
choice of the even non-linear function (pifJ-) = /"^ is shown on the left. Dependence of the 
escape time on learning rate u for the one-dimensional Hebbian ICA (K = M = I) is 
shown on the right. The solid line shows the trend predicted by the theory. The points 
with error bars denote the simulation results for K3 = 1.5 and = 50 averaged over 50 
simulations. 



potential will depend on the particular value of K and M considered, making analysis less 
straightforward than for the odd non-linearity. 

The shape of the potential for an example with two sources of equal skewness and one 
projection (K = 1, M = 2) is shown on the right of Figure |2l In this case we have a ledge 
in the potential with two points of minimum height AU (escape points). In the general case 
we find that the effective size of the minimal barriers is given by, 

AU f(K) 1/2 

= — — — (30) 

where the function f{K) has a complex form which depends on the choice of non-linear 
function For example, the shape of this function for (pifJ-) = /"^ is shown on the 

left in Figure |4] We see that the size of potential barrier decreases with increasing number 
of projections K and this appears to be a general feature of the function. This suggests 
that parallel algorithms for extracting asymmetrical signals may prove more efficient than 
deflationary ones which separate one signal at a time. 

For the case of a single projection {K = 1) the potential does decompose into a sum of 
independent terms and each component of r evolves independently. For the case of sources 
with equal skewness Kg = K3 (i = 1, 2, M), the mean first passage time, i.e. the time 
until one of the source signals is learned, is then given by. 



Ar2 

escape ^ f 



1 f (<A^(/x))z. 



12 V^3(<^"(/«)) 



(31) 



Numerical results from simulations of this scenario with = 50 are shown on the right of 
Figures|5]and|3] The asymmetrical sources used in the simulations are binary and each have 
the same skewness with Ats = 1.5. In Figure|3]we show the escape time as a function of the 
number of sources M on a log-log plot. The solid line shows the inverse scaling predicted 
by the theory and is consistent with the experimental results. On the right of Figure |4] we 
show how the escape time (on a log scale) depends on the learning rate parameter v. The 
slope of the solid line is the theoretical prediction from Equation dSTT i and is consistent with 
the simulation results. 
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Figure 5: Transient dynamics of the Hebbian ICA algorithm for K = M = 4 and N = 50. 
The left plot shows the simulation results for a single run. Dependence of time required to 
learn next source signal on the number of unlearned sources is shown on the right plot using 
a log-log scale. The solid line shows the trend predicted by the theory. The points with error 
bars denote the simulation results averaged over 10 simulations. 



6 Transient Dynamics 

Consider the more general situation when T < M non-Gaussian sources have already 
been learned by the system. The corresponding fixed points of Equation (I14t are R*j = 
6ij l[i < T], where l[i < T] is defined by (fT6b . We have already considered the special 
case when T = which is appropriate close to the initial conditions when no sources have 
yet been learned. A typical learning dynamics proceeds by passing through these states one 
by one until all the sources are learned. We show such a dynamical trajectory on the left of 
Figure |5l The indices have been labeled retrospectively so that the sources are learned in 
order although this labeling is clearly arbitrary and only chosen for notational convenience. 
It appears that the typical time to learn each source increases as more sources are learned, a 
phenomenon which is explained below. 

We introduce new 0(1) scaled variables, 

u = ri N'^, V = {R- R*)^, (32) 

where A'^ is the input dimension and d is the scaling order for the learning rate. For the 
case of an even non-linearity we set d = | while for the case of an odd non-linearity we 
choose d = 2. In our new variables the fixed point is = 0. We can compute the mean 
and covariance of these variables as we did for the r variables close to the initial conditions 
in the previous Section. In the present case it is convenient to consider four categories of 
variables separately. 

L i<T,j< T. 

Coy[Avij, Avh] = 0{N~P) . (33) 

2. i>T, j <T 

E[Avij] = -l^jVijuN^-P + OiN-P) , Cov[Avij, Avki] = 0{N-p) . (34) 

3. i<T,j>T 

E[Av,j] = -Cmji^N^-P + 0{N-P) , Cov[Avij, Avki] = 0{N-p) . (35) 
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4. i>T, j >T 



dvij 

Coy[Av,j, Avki] = D 6ik6jiN-P + OiN-P-') . (36b) 

In these equations, p is the scaling order of our system with p = 2 for the even non-linearity 
and p = 3 for the odd non-linearity. We define, 

= an {Es^[si(j){si) - (l)'{si)]) . 

In Equation (I36bt the exact expressions for the potential U{v) and for the diffusion coeffi- 
cient D have the same form as those which were found in Section|5lgiven by Equations (l22l . 
(El, (EUl and (EH. 

In the first three groups of variable we observe that the fluctuations are of negligible 
order, so that the dynamics can be described by linear differential equations in the large 
N limit with a relatively fast time-scale of 6t^^ = N^^^. Equations (l33l . (l34l i and (l35l 
therefore converge exponentially to the fixed point as long as the condition in Equation ^ 
is met. However, the variables in the fourth group display a similar diffusive dynamics to 
that considered in the previous Section. The dynamics for these variables is completely 
equivalent to the r-dynamics close to the initial conditions. Therefore we observe the same 
behaviour in these variables, with localisation at the fixed point until one component escapes 
over the potential barrier resulting in another source being learned. Once all the sources are 
learned we effectively have T = M and only the first three groups of variables above 
remain. In this case we can increase the learning rate to an 0{N^^) quantity in principle 
without the stochastic effec ts dominating, but then the learning rate should be annealed as 
described by'Rattr^ (12002 ) in order to converge asymptotically to the optimal solution. 



The picture on the left in Figure |5]is a good illustration of the transient dynamics. We 
show numerical simulations for the typical dynamics of a Hebbian ICA algorithm extracting 
four (K = M = 4) uniformly distributed non-Gaussian sources from an = 50 dimen- 
sional data set. On the right we show a log-log plot of the time At"""'"' required to learn the 
next source signal in the case when i non-Gaussian sources have already been learned by 
the system and we see that it is consistent with the expected trend shown by the solid line. 
The total learning time for extracting all the non-Gaussian sources in this case will be 



M-l 

= E ^^r' oc exp ^ ^nj— W ' (37) 



1=0 

where AU/D is given by Equation (EH- 



7 Natural Gradient ICA 



Natural gradient algorithms have been developed for ICA which use the structure o f the pa- 
rameter space to define a Riemannian gradient de scent direc tion ( Amari et al.', 1996h . Along 
with closely related relative gradient algorithms (^ Cardoso'and Laheld.. .l996) these methods 
provide some advantages over standard gr adient descen t methods, such as greater simplic- 
ity, robustness and asymptotic efficiency ( AmariL I998h . However, these algorithms have 
mainly been defined for the special case where the mixing matrix is square and invertible. 
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The algorithm we use here searches the space of tall thin orthogonal matrices. This 
allows it to extract a relatively small number of independent components from a high- 
dimensional data set possibly containing Gaussian components. Standard natural gradient 
ICA algorithms are not appropriate in this case and we therefore need a different approach. 
One possibility would be t o use a pa rameterisa t ion of the set of orthogonal matrices. This 
approach is considered by iMoon and Gunther ( 2002 ) who provide an interesting reinter- 



pretation of natural gradient as a pullback. This allows them to define natural gradient 
algorithms for various structured matrices. Although they restrict their attention to square, 
invertible matrices their ideas could be extended to tall thin matrices. However, the avail- 
able parameterisations appear to be quite complex in this case and computing the gradient 
even more so. 

The approach of *Moon and Gu ntheil (l2002 h is to use a set of coordinates which are 
intrinsic to the manifold. An alternative approach is to use the original variables subject 
to constraints, i.e. work in the space of tall thin matrices W but impose the orthogonality 
constraint, 

W'^W = I . (38) 

This is the approach taken by 'Edelman et al and it leads to a mu ch more straight- 

forward gradient definition for ICA which is described by lAmaril (Il999l) . The constraint 
surface is known as a Stiefel manifold and for a function F{W) defined on the Stiefel 
manifold, the "natural" gradient of F at the point W of the manifold is defined by 

VwF = ^wF - WVwF'^W, (39) 

where the standard gradient Vvi/-F is the K-hy-M matrix of partial derivatives of F with 
respect to the elements of W . The loss function used in Hebbian ICA is some non-quadratic 
function of the projections F{y) and the standard gradient of this function is given by 

VwF = X(f){y)^(T . (40) 

Then, accor ding to Equation (I39t . the natural gradient of this function on the Stiefel mani- 
fold will be (lAmariLil999.') . 



VwF = x<P{yfcT - Wa<P{y)y'^. (41) 

A disadvantage of using non-intrinsic variables is that the algorithm is not guaranteed to stay 
on the manifold. This is especially problematic for stochastic gradient algorithms which 
only approximately follow the gradient direction. We therefore add the same orthogonalisa- 
tion term used in the standard Hebbian algorithm. The natural gradient algorithm then has 
the following form, 

AW = f] [xcf}{yfo- - WcT^{y)y^] + aW{I - W^W) . (42) 

The update increment for the overlap matrices R = Ag and Q = W'^W at every 
learning iteration is found to be, 

AR = r]{(Tct>{y)s'^ -y^iyfaR)+a{I-Q)R, (43) 

AQ = i]a{I-Q + a{I-Q)-a{I-Q)Q)ct>{y)y^ + a^{I-QfQ 

+ r]aycj){y)^ {I - Q + a{I - Q) - aQ{I - Q)) + 2a{I - Q)Q 

+ rfcl){y)x^xcl){yf . (44) 
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(t){y) = y3, K4 = -4/9, u = 1 




1 1 N'^ number of unlearned sources 

Figure 6: Transient dynamics of the natural gradient Hebbian ICA algorithm for K = 
M = 4 and = 50 (compare with Figure [Sjl. The left plot shows the simulation results 
for a single run. Dependence of time required to learn next source signal on the number 
of unlearned sources is shown on the right plot using a log-log scale. The solid line shows 
the trend predicted by the theory. The points with error bars denote the simulation results 
averaged over 10 simulations. 



After adiabatic elimination of the Q variables by a similar procedure as we carried out for 
the case of Hebbian ICA (see Section|3ll we have the following dynamical equations for the 
overlaps, 

AR = r]a ((/.(y)^^ - ycl>{yf R) - ^7]^Nct>{y)cl>{yf R . (45) 

The typical learning dynamics of this natural gradient version of Hebbian algorithm is 
shown on the left of Figure|51 where we used the odd non-linearity (p{y) = to extract four 
symmetrical sources with kurtosis K4 = —4/9 from 50-dimensional data (K = M = 4, 
= 50). 

Following the procedure outlined in Section |6l we can expand near the general fixed 
points with T < M sources learned. Using the same variables we expand around v = 
and find the following results 

1. i<T,j< T. 

E[Avij] = [- iC^ + VijuN^-P + OiN-P) , Cov[Av^j, Avki] = O(A-p) . 

(46) 

2. i>T, j <T 

E[Avij] = -ijVijuN^-P + O(A-P) , Cov[Avij, Avki] = 0{N-p) . (47) 

3. i<T,j>T 

E[Av,j] = -Cmji^N^-P + 0{N-P) , Cov[Avij, Avki] = 0{N-p) . (48) 

4. i>T, j >T 

E[Avij] = -^^^N-P + 0{N-P-') , (49a) 
Coy[Avij, Avki] = D 6^kSjiN-P + OiN-P^^). (49b) 
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As before, p = 2 for the even non-linearity and p = 3 for the odd non-linearity and the 
potential in the last case is the same as for standard Hebbian ICA. We see that the equations 
for the first set of variables (i < T, j < T) in Equation (l46t are simplified in compari- 
son to Equation d33l and no longer contain cross-terms. This means, for example, that the 
algorithm will enjoy uniform asymptotic convergence if all sources have identical statis- 
tics and M = K. Generally speaking the eigenvalues determining the convergence of the 
variables in the first three groups have lower variance and the asymptotic convergence of 
the natural gradient algorithm will be faster than that of the Hebbian algorithm. However, 
Equations (I49bl) and (I36bl) are identical and therefore the transient dynamics of the algo- 
rithms will be very similar. These are the variables which provide the rate limiting factor 
and learning time-scale during the transient. 

The plot on the right of Figure |6l shows the escape time from each of the transient 
fixed points encountered during the dynamics. These simulation results confirm that the 
transient dynamics is very similar to the standard Hebbian algorithm results (see Figure |5ll 
as predicted by our theory. 



8 Conclusion 

The dynamics of on-line ICA learning has been studied in the limit of large data dimension. 
We have analysed a Hebbian learning algorithm which is appropriate for extracting a pre- 
scribed number of components from high dimensional data possibly containing Gaussian 
components. We also studied a natural gradient variant of the algorithm which uses the 
gradient defined on the Stiefel manifold of orthogonal matrices. 

We find that the learning time-scale of both algorithms is mainly determined by the tran- 
sient dynamics. Learning takes place by a sequence of symmetry breaking steps in which 
a new source is learned and these steps can be described as a diffusion and escape process. 
The learning time-scale is found to be longer than expected from the analysi s of related algo- 
rithrns such as on-line back-prop agation and Sanger's PCA algorithm (e.g. ISaad and SoM 



1995 . Biehl and Schlossei . 1998h . To learn each symmetric source typically requires of the 



order of A'^^ learning iterations while to learn an asymmetrical source using an even non- 
linearity typically requires of the order of N"^ learning iterations. Both algorithms exhibit 
equivalent transient dynamics and we only find an advantage in using the natural gradient 
variant asymptotically. 
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